OTTERS: a powerful TWAS framework leveraging summary-level reference data

Most existing TWAS tools require individual-level eQTL reference data and thus are not applicable to summary-level reference eQTL datasets. The development of TWAS methods that can harness summary-level reference data is valuable to enable TWAS in broader settings and enhance power due to increased reference sample size. Thus, we develop a TWAS framework called OTTERS (Omnibus Transcriptome Test using Expression Reference Summary data) that adapts multiple polygenic risk score (PRS) methods to estimate eQTL weights from summary-level eQTL reference data and conducts an omnibus TWAS. We show that OTTERS is a practical and powerful TWAS tool by both simulations and application studies.

Major comments: 1. The authors performed simulations to assess OTTERS alongside FUSION and various PRS approaches. However, I found it somewhat limited by the decision to use the genetic variation around a single gene, rather than sample a random gene region, for each simulation. Similarly, null simulations relied on a single instance of model fitting with permutation.
2. I think it would be helpful to understand the circumstances of when certain PRS approaches perform better, by diving a bit more into the eQTLGen results and analyzing if there are architectural patterns associated with model selection. For example, genes with greater cis-heritability, mean LDscore, etc. I understand that part of the benefit of OTTERS is to combine this approaches regardless of which performs best individually, but shedding some light on these characteristics may help better understand results.
Minor comments: 1. "Existing TWAS tools require individual-level eQTL reference data […]". While this is largely true, there are MR and SMR related tools that require only summary-level data to make inferences about gene/trait relationships.

Reviewer #2 (Remarks to the Author):
Enclosed is a review of Dai et al's manuscript: "OTTERS: A powerful TWAS framework leveraging summary-level reference data" In this manuscript, the authors present a novel TWAS framework that is able to use eQTL summary statistics to train predictive models of gene expression by repurposing different methods to train polygenic scores. OTTERS then integrates these models with GWAS to detect genetic associations. The authors conducted several simulations to present various genetic architectures where each PGS method provides the most powerful detection of genetic association and conducts a TWAS of cardiovascular disease to report novel finding. Overall, the method is interesting and clever, from the math to the name! However, I have some (minor) concerns about the evaluation of the method. My main concern, which I believe is a considerably important one, is the lack of direct comparison to an expression prediction method that leverages individual-level genotypes in real data. The main premise of my concern is this: if traditional individual-level predictive models of expression at such large sample sizes enable more powerful TWAS that OTTERS, then (a) this should be reported and (b) a duty of eQTL consortia to provide pre-trained weights corresponding to their results. Some of the co-authors of this manuscript are lead investigators for these large initiatives. This is a common practice now, given recent initiatives like OmicsPred (https://www.omicspred.org/). I appreciate that the authors consider this comparison in simulations, but in my experience, real data is often not similar.
Nevertheless, the aggregation of methods that consider different architectures is specifically an aspect of this method that seems intuitively important.
I detail my comments below: RESULTS 1. Line 168 -are these 4 PGS methods the best out of other available methods? Did the authors consider more recent methods like MegaPRS (10.1038/s41467-021-24485-y) or PUMAS (10.1186/s13059-021-02479-9)? 2. Line 196 -as I mention above, I appreciate this comparison back to FUSION in the simulations! These results are very promising! 3. This is my remaining concern with the Results presented. I agree that the OTTERS framework for model training is useful when individual level genotypes are not available in the eQTL dataset. But the authors do not compare OTTERS to a traditional individual-level TWAS model (e.g., PrediXcan/FUSION). If individual-level TWAS methods outperform OTTERS, it seems reasonable to expect large QTL consortia to provide pre-computed predictive models of gene expression for the public, along with their eQTL summary statistics. I urge the authors to consider including a comparison to individual-level TWAS models in the real data applications in datasets of equal sample size (using whatever method they prefer, at whatever sample sizes, etc). I do not believe computational cost should be an issue: on similar Linux shell as the authors presented, I was able to train an elastic net model using glmnet or a SuSiE model with 32,000 samples and 10,000 features in around 3 minutes. Just a handful of comparisons would be helpful here.

Response to Reviewers
We are grateful for the reviewers' helpful comments, which have enhanced the quality of our manuscript. In addition to addressing all reviewer comments, we updated our TWAS analyses based on eQTLGen summary statistics and UK Biobank 1 (UKBB) GWAS summary data of cardiovascular disease by matching strand flip SNPs between the two datasets. Such matching increased the number of test SNPs used in TWAS and resulted in the identification of more TWAS risk genes than in our initial submission.
Within the manuscript without tracking changes, we denote new or substantially revised material by a vertical line in the left margin.

Reviewer #1 (Remarks to the Author):
Overview: Dai et al present OTTERS, a TWAS framework, to identify gene/trait associations from summary-level GWAS and functional data. Most TWAS pipelines/approaches rely on individual-level data to first build predictive models of gene expression from genetic data, and then integrate those models with downstream GWAS summary data to perform association testing. While this approach has worked well in practice, a limitation is that recent ultra-largescale eQTL data (e.g., eQTLGen) has only released summary statistics, which limits the application of traditional penalized model fitting. OTTERS proposes to leverage polygenic riskscore prediction tools to fit predictive models using the eQTL summary data, rather than individual-level data, and integrate weights with downstream GWAS data. In order to combine the results across many possible predictive approaches, OTTERS also proposes an ACATbased meta-analysis. I found the approach interesting, and the results largely supporting claims. However, I have some comments regarding simulation design.
We would like to thank the reviewer for these encouraging comments.
Major comments: 1. The authors performed simulations to assess OTTERS alongside FUSION and various PRS approaches. However, I found it somewhat limited by the decision to use the genetic variation around a single gene, rather than sample a random gene region, for each simulation. Similarly, null simulations relied on a single instance of model fitting with permutation.
We thank the reviewer for this suggestion. We substantially revised our simulation study based on real whole genome sequencing (WGS) data from the ROS/MAP cohort 2,3 and MSBB study 4 . We divided 14,772 genes within our WGS dataset into five groups according to gene length, and randomly selected 100 genes from each group (500 genes in total). We randomly split the WGS dataset into 568 training (30%) and 1326 testing samples (70%), to mimic a relatively small sample size in the real reference panel for training gene expression imputation models. We considered scenarios with 2 different proportions of causal cis-eQTL, = (0.001,0.01), and 3 different proportions of gene expression variance explained by causal eQTL, ℎ 2 = (0.01, 0.05, 0.1). For each gene in each scenario, we simulated 10 replicates of gene expression. For each simulated gene expression, we generated 10 sets of GWAS Z-scores to perform a total of 50,000 TWAS simulations per scenario. Let denotes the assumed GWAS sample size, and ℎ 2 denotes the amount of phenotypic variance explained by simulated GReX. We set ℎ 2 = 0.025 and considered = (200K, 300K, 400K, 500K) for scenarios with ℎ 2 = 0.01, = (25K, 50K, 75K, 100K) for scenarios with ℎ 2 = 0.05, and = (10K, 20K, 30K, 40K) for scenarios with ℎ 2 = 0.1.
Based on the new simulation results with 500 randomly selected genes, we demonstrated that the Stage I training method with the optimal test 2 and TWAS power depended on the underlying genetic architecture of gene expression ( ) as well as gene expression heritability (ℎ 2 ) ( Figure 2 in the manuscript, which is reproduced below). In situations where true cis-eQTLs were sparse ( = 0.001) and the gene expression heritability was small (ℎ 2 = 0.01), P+T (0.05) method performed the best with the highest TWAS power among all individual methods. When gene expression heritability increased (ℎ 2 = 0.05 or 0.1) within this sparse eQTL model, P+T (0.001) and PRS-CS were generally the optimal methods. For a less sparse model with = 0.01, SDPR and PRS-CS generally performed best among the individual methods. Relative to individual methods, however, we found that combining the TWAS p-values based on the four PRS training methods together for analysis in our OTTERS framework obtained the highest power across all scenarios.

Figure 2: Test (A) and TWAS power (B) comparison in simulation studies
To evaluate the type I error, we picked one simulated replicate per gene from the scenario with ℎ 2 = 0.1 and = 0.001 , simulated 2 × 10 3 phenotypes from (0,1) , and permuted the eQTL weights for TWAS to perform a total of 10 6 null simulations. OTTERS was well calibrated as shown by quantile-quantile (QQ) plots of TWAS p-values ( Figure S1, which is reproduced below). We observed that OTTERS had well-controlled type I error for stringent significance levels between 10 −4 and 2.5 × 10 −6 (Table S1, which is reproduced below) that are typically utilized in TWAS. For more modest significance thresholds ( = 10 −2 ), we noted that OTTERS had slightly inflated type-I error rate; this result is consistent with the findings of the original ACAT work (Liu et al. AJHG 104: 410) 5 that showed the Cauchy-distribution-based approximation may not be accurate for larger p-values when correlation among tests is strong (see Appendix A of Liu et al).
Since most TWAS analyses need to adjust for analysis of ~20K genes and therefore consider significance levels between 10 −5 and 10 −6 , we do not believe this result to diminish the practical impact of OTTERS but we do suggest that users interpret more modest p-values with caution.
We have updated all simulation results in the Results section (pages 6-9, lines 152-220) in the revised manuscript.  2. I think it would be helpful to understand the circumstances of when certain PRS approaches perform better, by diving a bit more into the eQTLGen results and analyzing if there are architectural patterns associated with model selection. For example, genes with greater cisheritability, mean LDscore, etc. I understand that part of the benefit of OTTERS is to combine this approaches regardless of which performs best individually, but shedding some light on these characteristics may help better understand results.
In Figure S5, we show the eQTL weights for gene SIDT2, which was a significant risk gene identified by both PRS-CS and SDPR, and had p-values < 10 −4 by other methods. Compared to lassosum, SDPR had more significant GWAS SNPs colocalized with eQTLs having relatively large weights in the test region, and PRS-CS had more non-significant GWAS SNPs colocalized with eQTLs having zero weights. Compared to the P+T methods, SDPR and PRS-CS based on a multivariate regression model modeled LD among all test SNPs, and thus estimated eQTL weights leading to significant TWAS findings.
In Figure S6, we provide the results of the gene EDN3, which was only identified by P+T methods (p-values ≤ 9.15 × 10 −8 ). Compared to P+T methods, SDPR (p-value = 5.9 × 10 −3 ) and PRS-CS (p-value = 0.0158) had fewer significant GWAS SNPs colocalized with eQTLs that had relatively large weights in the test region, while lassosum (p-value = 8.6 × 10 −6 ) assigned relatively large weights to more non-significant GWAS SNPs.
In Figure S7, we provide results for gene LINC01093, which was only identified by lassosum. For this gene, SDPR and PRS-CS estimated near-zero weights for most SNPs with significant GWAS p-values in the test region. Most significant GWAS SNPs did not have eQTL test pvalues < 0.001 or 0.05, and were thus filtered out by P+T methods. lassosum was the only method that produced relatively large eQTL weights that co-localized with GWAS significant SNPs.
We include these figures in the supplemental material ( Figures S5-S7) and mention them in the Results section (page 13, lines 312-328). We thank the reviewer for pointing out this sentence. We now discuss how SMR can be used as an alternative analysis method besides TWAS in the Discussion (page 16, lines 392 -396).

Reviewer #2 (Remarks to the Author):
Enclosed is a review of Dai et al's manuscript: "OTTERS: A powerful TWAS framework leveraging summary-level reference data" In this manuscript, the authors present a novel TWAS framework that is able to use eQTL summary statistics to train predictive models of gene expression by repurposing different methods to train polygenic scores. OTTERS then integrates these models with GWAS to detect genetic associations. The authors conducted several simulations to present various genetic architectures where each PGS method provides the most powerful detection of genetic association and conducts a TWAS of cardiovascular disease to report novel finding. Overall, the method is interesting and clever, from the math to the name! However, I have some (minor) concerns about the evaluation of the method. My main concern, which I believe is a considerably important one, is the lack of direct comparison to an expression prediction method that leverages individual-level genotypes in real data. The main premise of my concern is this: if traditional individual-level predictive models of expression at such large sample sizes enable more powerful TWAS than OTTERS, then (a) this should be reported and (b) a duty of eQTL consortia to provide pre-trained weights corresponding to their results. Some of the co-authors of this manuscript are lead investigators for these large initiatives. This is a common practice now, given recent initiatives like OmicsPred (https://www.omicspred.org/). I appreciate that the authors consider this comparison in simulations, but in my experience, real data is often not similar.
Nevertheless, the aggregation of methods that consider different architectures is specifically an aspect of this method that seems intuitively important.
We would like to thank the reviewer for the positive comments about OTTERS. Please see our detailed point-to-point responses as follows.
I detail my comments below:
We thank the reviewer for the comment. The main aim of our work was to show that PRS methods can be employed to perform TWAS using summary-level eQTL data and that combining the results of multiple PRS methods together in an optimal omnibus test using ACAT-O can be more powerful than simply using a single PRS method for this purpose. The PRS methods currently utilized in the OTTERS framework are representative of techniques that make different assumptions of the underlying eQTL architecture; thereby ensuring that OTTERS remains powerful across different generating mechanisms. We agree that more recent PRS methods could be added to the OTTERS framework, and could possibly yield improved results, but this additional integration is beyond the scope of this work. We now mention that MegaPRS 6 and PUMAS 7 can be included as additional methods for training gene imputation models and can be incorporated into OTTERS in the Discussion (page 16, lines 389-391).
2. Line 196as I mention above, I appreciate this comparison back to FUSION in the simulations! These results are very promising!
We appreciate the reviewer for recognizing the advantages of OTTERS in our simulation studies.
3. This is my remaining concern with the Results presented. I agree that the OTTERS framework for model training is useful when individual level genotypes are not available in the eQTL dataset. But the authors do not compare OTTERS to a traditional individual-level TWAS model (e.g., PrediXcan/FUSION). If individual-level TWAS methods outperform OTTERS, it seems reasonable to expect large QTL consortia to provide pre-computed predictive models of gene expression for the public, along with their eQTL summary statistics. I urge the authors to consider including a comparison to individual-level TWAS models in the real data applications in datasets of equal sample size (using whatever method they prefer, at whatever sample sizes, etc). I do not believe computational cost should be an issue: on similar Linux shell as the authors presented, I was able to train an elastic net model using glmnet or a SuSiE model with 32,000 samples and 10,000 features in around 3 minutes. Just a handful of comparisons would be helpful here.
We want to thank the reviewer for this suggestion, which we have addressed by running an additional TWAS of cardiovascular disease in the UKBB where, for reference, we used individual-level transcriptomic data available from 574 GTEx V8 whole blood samples. We trained FUSION on this individual data, and then compared results to OTTERS using cis-eQTL summary statistics from these same 574 GTEx V8 whole blood samples coupled to reference LD from GTEx V8 WGS samples. We first compared training 2 of gene expression imputation models trained by FUSION with single summary-level method: P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS. The training 2 is simply the squared Pearson correlation coefficient between imputed GReX and true gene expression in the 574 training samples. We considered trained imputation models with training 2 > 0.01 as "valid" training models, according to previous TWAS methods 8,9 . By comparing training 2 per "valid" GReX imputation model by lassosum versus the other methods ( Figure S10, which is reproduced below), we observed that lassosum had the best overall performance for imputing GReX as it provided the most "valid" models with higher training 2 than FUSION and other single summary-level methods.
Next, we compared FUSION with OTTERS with respect to the TWAS results using the same GWAS summary data of UKBB. We tested 19,653 genes and identified genes with p-values < 2.53 × 10 −6 (Bonferroni corrected significance level) as significant TWAS genes. To identify independently significant TWAS genes, we calculated the 2 (squared correlation) of estimated genetic regulated gene expression (GReX) by lassosum for each pair of genes. For a pair of genes with estimated GReX 2 > 0.5, we only kept the gene with the smaller TWAS p-value as the independently significant gene. OTTERS obtained 34 independently significant TWAS genes, while FUSION identified 21 significant TWAS genes ( Figure S11 and Table S3, which are reproduced below). A total of 14 genes were identified by both FUSION and OTTERS. This comparison between FUSION and OTTERS using the same training and test data demonstrates that OTTERS finds more risk genes than FUSION even when individual-level reference data are available. This finding in real data is consistent with our simulation results.
We report these findings in the Results section (pages 14-15, lines 342-366) and in the supplemental materials (Figures S10-S11 , Table S3). Based on these findings, we don't believe it is necessary for consortia to provide pre-computed predictive models of gene expression for the public. In any event, we feel the production of such models would likely face serious practical hurdles in creation. As consortia like eQTLGen are composed of multiple individual studies, the delivery of these results would require each individual study to run their own predictive model separately (each using the same procedure and QC pipeline) and then somehow combine the results across studies dealing with different sample sizes, different training 2 , and different covariates. In the unlikely scenario a mega-analysis is possible where the individual data from multiple studies can be pooled together for prediction after approval of complicated data-transfer agreements, the potential heterogeneity among studies due to LD, ancestry, or other factors still could lead to inaccurate inference.